P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). This analysis is carried out only for the red wine data.
fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
chlorides: the amount of salt in the wine
free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and sulfide ion; it prevents microbial growth and the oxidation of wine
total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
density: the density of water is close to that of water depending on the percent alcohol and sugar content
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
sulfates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant
alcohol: the percent alcohol content of the wine
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## [1] 1599 13
There 1599 observations and 12 variables. The variable “X” is not required and can be deleted. Now let’s look at the some of the first and last few observations and structure of the data.
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1594 6.8 0.620 0.08 1.9 0.068
## 1595 6.2 0.600 0.08 2.0 0.090
## 1596 5.9 0.550 0.10 2.2 0.062
## 1597 6.3 0.510 0.13 2.3 0.076
## 1598 5.9 0.645 0.12 2.0 0.075
## 1599 6.0 0.310 0.47 3.6 0.067
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 1594 28 38 0.99651 3.42 0.82
## 1595 32 44 0.99490 3.45 0.58
## 1596 39 51 0.99512 3.52 0.76
## 1597 29 40 0.99574 3.42 0.75
## 1598 32 44 0.99547 3.57 0.71
## 1599 18 42 0.99549 3.39 0.66
## alcohol quality
## 1594 9.5 6
## 1595 10.5 5
## 1596 11.2 6
## 1597 11.0 6
## 1598 10.2 5
## 1599 11.0 6
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
All are numerical variables except quality of wine which is integer. There are no missing values in data set. Now, let’s look at the summary of the data.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
The most of the wine samples are of average quality with rating of 5 and 6. The quality variable can be the categorical variable with quality levels from 1 to 10. Wine samples with ratings of 1 being the worst and 10 being the best. The new categorical variable as described below is created with name “fquality”.
## 'data.frame': 1599 obs. of 13 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ fquality : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## # A tibble: 6 x 3
## fquality n proportion
## <ord> <int> <dbl>
## 1 3 10 0.01
## 2 4 53 0.03
## 3 5 681 0.43
## 4 6 638 0.40
## 5 7 199 0.12
## 6 8 18 0.01
Around 83 % of wines are of average quality [rating 5 and 6]. Around 4 % wines are of worst quality [rating 3 and 4] and better quality [rating 7 and 8] are around 13 %. There are less number of best and worst quality wines.
## [[1]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
##
## [[2]]
## redwine$fquality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.700 7.150 7.500 8.360 9.875 11.600
## --------------------------------------------------------
## redwine$fquality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.600 6.800 7.500 7.779 8.400 12.500
## --------------------------------------------------------
## redwine$fquality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.100 7.800 8.167 8.900 15.900
## --------------------------------------------------------
## redwine$fquality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.700 7.000 7.900 8.347 9.400 14.300
## --------------------------------------------------------
## redwine$fquality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.900 7.400 8.800 8.872 10.100 15.600
## --------------------------------------------------------
## redwine$fquality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.250 8.250 8.567 10.225 12.600
Most of the wine samples have fixed acidity of in between 6 to 11 g/dm^3. The histogram is right skewed with lot of outliers in the data on higher side. Due to long tailed distribution the mean (8.32) of the samples is greater than that of the median (7.9) of the sample. The median and the mean of the fixed acidity is on slightly higher side for wines with the quality rating of 7 and 8.
We transform the data to check the normal distribution of the data.
The log transformed data is fairly normal with outliers on both sides of the distribution. The peak of the data occurs at around 7 g/dm^3.
## [[1]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
##
## [[2]]
## redwine$fquality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## redwine$fquality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## redwine$fquality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## redwine$fquality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## redwine$fquality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## redwine$fquality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
The distribution volatile acidity is right or positively skewed. With increasing the quality of the wine the mean and median of the volatile acidity decreases. As rightly mentioned in the variable description, too high of levels of volatile acidity can lead to an unpleasant, vinegar taste. Lets transform the data to check the normal distribution.
We can see that most of the samples lie in between 0.3 to 0.8 range. The best quality wines have volatile acidity of around 0.37 to 0.4, distribution is fairly normal with few outliers on both side of the distribution.
## [[1]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
##
## [[2]]
## redwine$fquality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0050 0.0350 0.1710 0.3275 0.6600
## --------------------------------------------------------
## redwine$fquality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0300 0.0900 0.1742 0.2700 1.0000
## --------------------------------------------------------
## redwine$fquality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
## --------------------------------------------------------
## redwine$fquality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
## --------------------------------------------------------
## redwine$fquality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3050 0.4000 0.3752 0.4900 0.7600
## --------------------------------------------------------
## redwine$fquality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.3025 0.4200 0.3911 0.5300 0.7200
Citric acid is used for adding flevour and ‘freshness’. It has long tailed distribution which is positively skewed. The distribution has multiple modes. Most of the observations following in between 0 to 0.5, we can see that best quality wines have higher citric acid (mean and median) levels of around 0.4. It will be interesting to see the bi-variate relationship with quality of wines.
## [[1]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
##
## [[2]]
## redwine$fquality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.875 2.100 2.635 3.100 5.700
## --------------------------------------------------------
## redwine$fquality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300 1.900 2.100 2.694 2.800 12.900
## --------------------------------------------------------
## redwine$fquality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.900 2.200 2.529 2.600 15.500
## --------------------------------------------------------
## redwine$fquality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.477 2.500 15.400
## --------------------------------------------------------
## redwine$fquality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 2.000 2.300 2.721 2.750 8.900
## --------------------------------------------------------
## redwine$fquality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 1.800 2.100 2.578 2.600 6.400
The distribution is highly right skewed with peak occurring at around 2 gm/dm^3. Most of the samples have residual sugars of around 0.5 to 3 gm/dm^3. The residual sugar is more or less constant across the different quality of wines.
Even with log distribution the data still remains non normal that is positively skewed.
## [[1]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
##
## [[2]]
## redwine$fquality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0610 0.0790 0.0905 0.1225 0.1430 0.2670
## --------------------------------------------------------
## redwine$fquality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000
## --------------------------------------------------------
## redwine$fquality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100
## --------------------------------------------------------
## redwine$fquality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500
## --------------------------------------------------------
## redwine$fquality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800
## --------------------------------------------------------
## redwine$fquality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600
The distribution of chlorides is highly skewed with lot of outlines on higher side. The median of chlorides is on lower side for wines with quality of 7 and 8.
The log distribution is fairly normal with outliers on both sides.
## [[1]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
##
## [[2]]
## redwine$fquality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 5.0 6.0 11.0 14.5 34.0
## --------------------------------------------------------
## redwine$fquality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 12.26 15.00 41.00
## --------------------------------------------------------
## redwine$fquality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 15.00 16.98 23.00 68.00
## --------------------------------------------------------
## redwine$fquality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 14.00 15.71 21.00 72.00
## --------------------------------------------------------
## redwine$fquality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 14.05 18.00 54.00
## --------------------------------------------------------
## redwine$fquality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 7.50 13.28 16.50 42.00
The free sulfur dioxide is long tailed and positively skewed and outliers are on higher side. The range of values is also high. The average quality wines have slightly higher levels of free sulfur dioxide. It prevents microbial growth and the oxidation of wine
The log distribution is fairly normal.
## [[1]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
##
## [[2]]
## redwine$fquality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 12.5 15.0 24.9 42.5 49.0
## --------------------------------------------------------
## redwine$fquality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 14.00 26.00 36.25 49.00 119.00
## --------------------------------------------------------
## redwine$fquality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 26.00 47.00 56.51 84.00 155.00
## --------------------------------------------------------
## redwine$fquality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 23.00 35.00 40.87 54.00 165.00
## --------------------------------------------------------
## redwine$fquality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 17.50 27.00 35.02 43.00 289.00
## --------------------------------------------------------
## redwine$fquality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 16.00 21.50 33.44 43.00 88.00
The distribution highly skewed with huge range. There outliers on higher side. The mean of total sulfur dioxide is on lower side for wines with rating of 7 and 8. There are two data points on extreme right side which needs further investigation.
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1080 7.9 0.3 0.68 8.3 0.05
## 1082 7.9 0.3 0.68 8.3 0.05
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 1080 37.5 278 0.99316 3.01 0.51
## 1082 37.5 289 0.99316 3.01 0.51
## alcohol quality fquality
## 1080 12.3 7 7
## 1082 12.3 7 7
We can see that that all feature values are same except total sulfur dioxide which is unusually high. This could be copy paste or typo error. We can delete these two extreme observations from the data set and again check the distribution of Total Sulfur dioxide.
## [[1]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.17 62.00 165.00
##
## [[2]]
## redwine$fquality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 12.5 15.0 24.9 42.5 49.0
## --------------------------------------------------------
## redwine$fquality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 14.00 26.00 36.25 49.00 119.00
## --------------------------------------------------------
## redwine$fquality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 26.00 47.00 56.51 84.00 155.00
## --------------------------------------------------------
## redwine$fquality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 23.00 35.00 40.87 54.00 165.00
## --------------------------------------------------------
## redwine$fquality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.0 17.0 27.0 32.5 43.0 106.0
## --------------------------------------------------------
## redwine$fquality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 16.00 21.50 33.44 43.00 88.00
The distribution looks fairly normal with no outliers.
## [[1]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9968 0.9978 1.0037
##
## [[2]]
## redwine$fquality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9947 0.9961 0.9976 0.9975 0.9988 1.0008
## --------------------------------------------------------
## redwine$fquality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9934 0.9957 0.9965 0.9965 0.9974 1.0010
## --------------------------------------------------------
## redwine$fquality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9926 0.9962 0.9970 0.9971 0.9979 1.0031
## --------------------------------------------------------
## redwine$fquality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9954 0.9966 0.9966 0.9979 1.0037
## --------------------------------------------------------
## redwine$fquality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9906 0.9948 0.9959 0.9961 0.9974 1.0032
## --------------------------------------------------------
## redwine$fquality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9908 0.9942 0.9949 0.9952 0.9972 0.9988
The distribution of density is normal with mean, median and mode occurring at around 0.998. The density of wines is almost constant across different quality of wines.
## [[1]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
##
## [[2]]
## redwine$fquality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.160 3.312 3.390 3.398 3.495 3.630
## --------------------------------------------------------
## redwine$fquality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.300 3.370 3.382 3.500 3.900
## --------------------------------------------------------
## redwine$fquality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.300 3.305 3.400 3.740
## --------------------------------------------------------
## redwine$fquality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.860 3.220 3.320 3.318 3.410 4.010
## --------------------------------------------------------
## redwine$fquality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.920 3.200 3.290 3.294 3.380 3.780
## --------------------------------------------------------
## redwine$fquality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.163 3.230 3.267 3.350 3.720
The distribution of pH is fairly normal with mean, median and mode occurring at 3.3. There are outliers on both the side of the distribution. pH is on lower side for wines with quality of 7 and 8.
## [[1]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6583 0.7300 2.0000
##
## [[2]]
## redwine$fquality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
## --------------------------------------------------------
## redwine$fquality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
## --------------------------------------------------------
## redwine$fquality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
## --------------------------------------------------------
## redwine$fquality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
## --------------------------------------------------------
## redwine$fquality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7436 0.8300 1.3600
## --------------------------------------------------------
## redwine$fquality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
The distribution of sulfates is long tailed and positively skewed. The peak is occurring at around 0.6. There are so many outliers in the data set. The sulfates are on slightly higher side for quality higher quality of wine. The sulfates might have impact on deciding the quality of wines.
The distribution looks fairly normal with outliers on right side.
## [[1]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
##
## [[2]]
## redwine$fquality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## redwine$fquality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## redwine$fquality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## redwine$fquality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## redwine$fquality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.46 12.10 14.00
## --------------------------------------------------------
## redwine$fquality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
The distribution of alcohol is long tailed and positively skewed with few outliers on higher side. Most of the values lies in between 9 to 12.
We can also see that the mean and median of the alcohol is on higher side for the wine qualities of 7 and 8 and alcohol is on lower side for low quality wines. In fact all summary parameters like min, median, max increases with increasing the quality of the wine. It looks like alcohol has huge influence on deciding the quality of the wine. It will be interesting to see the effect of alcohol coupled with other features of the data set on deciding the quality of the wine. Let’s see the log distribution below.
The distribution fairly looks normal with the peak occurring at 9.5
There are 1599 observations with 12 variables. All variables are of numerical type except a output quality variable which is a integer. The data is tidy with no missing values.
Alcohol, Volatile Acidity, Fixed Acidity, Citric Acid and pH are main features of the dataset.
Residual sugar, Suplhates and Total Sulfur dioxide.
The new categorical variable is created from integer variable of quality of wine.
In total sulfur dioxide there are two outliers with extreme levels, values are for other variables are same. I decided to delete these rows, assuming it is bad data of extreme case. This may be good data with extreme case and deleting these two rows may not have any impact on analysis. Since, data is tidy no other operations were performed.
Fixed acidity is positively correlated with citric acid and density of wines and negatively correlated with pH. These are moderate correlation with correlation coefficient of around 0.7.
Volatile acidity is negatively correlated with citric acid.
Citric acid negatively correlated with pH.
Residual sugar has no correlation with any other features.
Free sulfur dioxide has moderate correlation with total sulfur dioxide
Density is negatively correlated with alcohol
Alcohol is has positive correlation with quality of wines.
Now lets look at the scatter plots and corresponding correlation coefficient of above mentioned features of importance and interest.
## [1] "Coorelation coefficient between: alcohol and quality is 0.47"
There is weak positive correlation of alcohol with the quality of wines.
## [1] "Coorelation coefficient between: volatile.acidity and quality is -0.39"
There is weak negative correlation with volatile acidity with quality of wines.
## [1] "Coorelation coefficient between: fixed.acidity and citric.acid is 0.67"
Fixed acidity and citric acid has moderate positive correlation.
## [1] "Coorelation coefficient between: fixed.acidity and density is 0.67"
Again fixed acidity has moderate positive correlation with density of wines.
## [1] "Coorelation coefficient between: fixed.acidity and pH is -0.69"
fixed acidity has moderate negative correlation with density of wines.
## [1] "Coorelation coefficient between: citric.acid and volatile.acidity is -0.55"
citric acid and volatile acidity are positively correlated. There is weak relationship in between them.
## [1] "Coorelation coefficient between: free.sulfur.dioxide and total.sulfur.dioxide is 0.67"
Since free sulfur dioxide is part of total sulfur dioxide. This relationship is on expected lines [moderate and positive correlation]
## [1] "Coorelation coefficient between: density and alcohol is -0.49"
Density and alcohol has weak negative correlation.
## [1] "Coorelation coefficient between: citric.acid and pH is -0.54"
similarly citric acid and pH also has weak and negative correlation.
Wines with quality rating of 7 and 8 have alcohol levels above 11. We can also see there is little bit of trend. With increase in levels of alcohol quality of wines also increases. This can also be observed in scatter plot of these features.
Lower volatile acidity have better quality wines. It will be interesting to see the combined effect of volatile acidity and alcohol in multivariate analysis.
If we look at the median of the box plots we can see a clear cut trend of quality of wines with citric acid. quality of wines increases with increase in levels of citric acid.
Sulphates also have positive impact on quality of wines.
Better quality wines have slightly lower levels of pH.
No
It can can not be called strongest but there is moderate relationship among features mentioned in above observations of correlation plots.
We can see that there is clear pattern emerging from the plot. The light orange dots coorespoding to quality rating of 5 and 6 are concentrated towards left middle portion of the plot. This the portion of higher volatile acidity and lower alcohol levels. These are medium quality wines [quality rating 5]. The blue and green dots corresponding to quality rating of 7 and 8 are concentrated towards lower middle portion of the graph. This is the region where alcohol is on higher side and volatile acidity is on lower side. This plot helps us in differentiating average quality wines from better quality wines.
This is different representation of above mutivariate scatter plot. The box plot has been created to make clear stratification of volatile acidity data. This helps in making colatile acidity as categorical variable than continous variable. In this plot I created three panels for different levels of volatile acidity. We can clearly see that for lower levels of volatile acidity and higher level of alcohol, wines are of better quality. This plot further bolsters our understanding about different features of this data set. We can further see that for higher levels of volatile acidity there are no wines of better quality [wines with quality rating of 7 and 8] as box plots for those qualities are absent.
We can see that blue and green [quality rating 7 and 8] dots concentration on right upper side of the plot. Higher level of citric acid and alcohol produces better quality wines. Average wine quality has lower levels of alcohol and lower level of citric acid. Even though some of blue and pink dots which have low levels of alcohol have high levels of citric acid. The wines have been rated high due high levels citric acid.
Higher alcohol and sulphates levels less than 1.16 produces wines wines of better quality. We can see that through above plots that alcohol has huge influence on making wines better.
We can see that there is no clear cut pattern in above scatter plot.
Yes there are some some surprising relationships. If we look at the better quality wines which have low level of alchol have high levels of citric acid. That sweetness effect is compensating for low levels of alcohol.
No models were created from the given dataset.
Most of the wines are of average quality. Around 83 % of wines are of average quality [rating 5 and 6]. Around 4 % wines are of worst quality [rating 3 and 4] and better quality [rating 7 and 8] are around 13 %. There are less number of best and worst quality wines. It would have been better if the quality of wines could have been distributed equally across all qualities of wines.
Box plots have been useful in understanding features impacting the quality of wines. Alcohol seems to have major impact in deciding the quality of wines. We can clearly see from the box plot that with increasing alcohol wines quality becomes better.
Volatile acidity and alcohol are two most important features of this data set. We can see concentration of blue and green plots towards lower right side of the plot. This indicates that lower volatile acidity and higher alcohol makes a better quality wines.
I carried out exploratory data analysis on red wine data set. This dataset has 1599 observation and 12 variables. There is one outcome variable which is quality of wines and other 11 are input/predictor variables.
In uni-variate analysis I started investigating individual variables. Most of the wines are of average quality. Most of the features are positively skewed and having long tailed distribution.
In bi-variate analysis we saw that, alcohol, volatile acidity, sulphates, chlorides, pH and citric acid have major influence on quality of wines. Fixed acidity, total acidity, residual sugar, free sulfur dioxide, and total sulfur dioxide have no major influence on quality of wines.
The data set has 1599 observation but quality of wines is not equally distributed across all qualities of wine. The are more average quality wines than better and worst quality wines. Differentiating bad wines from better one could have been more easier if wine qualities would have equal distribution.
In future cost of wines can be included. It will be interesting to see how quality of wines relates to its price. The design of experiment could be carried out by fixing some of the input or predictor variables and varying one or two other variables to understand impact on quality of wines.